Search CORE

Boston University Institutional Repository (OpenBU)

Seeing the Forest for the Trees: Using the Gene Ontology to Restructure Hierarchical Clustering

Author: Dotan-Cohen Dikla
Kasif Simon
Melkman Avraham A.
Publication venue: 'Oxford University Press (OUP)'
Publication date: 03/06/2009
Field of study

Motivation: There is a growing interest in improving the cluster analysis of expression data by incorporating into it prior knowledge, such as the Gene Ontology (GO) annotations of genes, in order to improve the biological relevance of the clusters that are subjected to subsequent scrutiny. The structure of the GO is another source of background knowledge that can be exploited through the use of semantic similarity. Results: We propose here a novel algorithm that integrates semantic similarities (derived from the ontology structure) into the procedure of deriving clusters from the dendrogram constructed during expression-based hierarchical clustering. Our approach can handle the multiple annotations, from different levels of the GO hierarchy, which most genes have. Moreover, it treats annotated and unannotated genes in a uniform manner. Consequently, the clusters obtained by our algorithm are characterized by significantly enriched annotations. In both cross-validation tests and when using an external index such as protein–protein interactions, our algorithm performs better than previous approaches. When applied to human cancer expression data, our algorithm identifies, among others, clusters of genes related to immune response and glucose metabolism. These clusters are also supported by protein–protein interaction data. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.Lynne and William Frankel Center for Computer Science; Paul Ivanier center for robotics research and production; National Institutes of Health (R01 HG003367-01A1

To Supplement or Not to Supplement: A Metabolic Network Framework for Human Nutritional Supplements

Author: Kasif Simon
Nogiec Christopher D.
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 05/08/2013
Field of study

Flux balance analysis and constraint based modeling have been successfully used in the past to elucidate the metabolism of single cellular organisms. However, limited work has been done with multicellular organisms and even less with humans. The focus of this paper is to present a novel use of this technique by investigating human nutrition, a challenging field of study. Specifically, we present a steady state constraint based model of skeletal muscle tissue to investigate amino acid supplementation's effect on protein synthesis. We implement several in silico supplementation strategies to study whether amino acid supplementation might be beneficial for increasing muscle contractile protein synthesis. Concurrent with published data on amino acid supplementation's effect on protein synthesis in a post resistance exercise state, our results suggest that increasing bioavailability of methionine, arginine, and the branched-chain amino acids can increase the flux of contractile protein synthesis. The study also suggests that a common commercial supplement, glutamine, is not an effective supplement in the context of increasing protein synthesis and thus, muscle mass. Similar to any study in a model organism, the computational modeling of this research has some limitations. Thus, this paper introduces the prospect of using systems biology as a framework to formally investigate how supplementation and nutrition can affect human metabolism and physiology

Harvard University - DASH

FigShare

GEMS: a web server for biclustering analysis of expression data

Author: Kasif Simon
Wu Chang-Jiun
Publication venue: Oxford University Press
Publication date: 01/01/2005
Field of study

The advent of microarray technology has revolutionized the search for genes that are differentially expressed across a range of cell types or experimental conditions. Traditional clustering methods, such as hierarchical clustering, are often difficult to deploy effectively since genes rarely exhibit similar expression pattern across a wide range of conditions. Biclustering of gene expression data (also called co-clustering or two-way clustering) is a non-trivial but promising methodology for the identification of gene groups that show a coherent expression profile across a subset of conditions. Thus, biclustering is a natural methodology as a screen for genes that are functionally related, participate in the same pathways, affected by the same drug or pathological condition, or genes that form modules that are potentially co-regulated by a small group of transcription factors. We have developed a web-enabled service called GEMS (Gene Expression Mining Server) for biclustering microarray data. Users may upload expression data and specify a set of criteria. GEMS then performs bicluster mining based on a Gibbs sampling paradigm. The web server provides a flexible and an useful platform for the discovery of co-expressed and potentially co-regulated gene modules. GEMS is an open source software and is available at

Boston University Institutional Repository (OpenBU)

Crossref

GEMS: a web server for biclustering analysis of expression data

Author: Kasif Simon
Wu Chang-Jiun
Publication venue: Oxford University Press
Publication date: 01/01/2005
Field of study

Boston University Institutional Repository (OpenBU)

Crossref

Public Library of Science (PLOS)

Probabilistic Protein Function Prediction from Heterogeneous Genome-Wide Data

Author: Kasif Simon
Kolaczyk Eric D.
Nariai Naoki
Publication venue: Public Library of Science
Publication date: 01/03/2007
Field of study

Dramatic improvements in high throughput sequencing technologies have led to a staggering growth in the number of predicted genes. However, a large fraction of these newly discovered genes do not have a functional assignment. Fortunately, a variety of novel high-throughput genome-wide functional screening technologies provide important clues that shed light on gene function. The integration of heterogeneous data to predict protein function has been shown to improve the accuracy of automated gene annotation systems. In this paper, we propose and evaluate a probabilistic approach for protein function prediction that integrates protein-protein interaction (PPI) data, gene expression data, protein motif information, mutant phenotype data, and protein localization data. First, functional linkage graphs are constructed from PPI data and gene expression data, in which an edge between nodes (proteins) represents evidence for functional similarity. The assumption here is that graph neighbors are more likely to share protein function, compared to proteins that are not neighbors. The functional linkage graph model is then used in concert with protein domain, mutant phenotype and protein localization data to produce a functional prediction. Our method is applied to the functional prediction of Saccharomyces cerevisiae genes, using Gene Ontology (GO) terms as the basis of our annotation. In a cross validation study we show that the integrated model increases recall by 18%, compared to using PPI data alone at the 50% precision. We also show that the integrated predictor is significantly better than each individual predictor. However, the observed improvement vs. PPI depends on both the new source of data and the functional category to be predicted. Surprisingly, in some contexts integration hurts overall prediction accuracy. Lastly, we provide a comprehensive assignment of putative GO terms to 463 proteins that currently have no assigned function

Boston University Institutional Repository (OpenBU)

Boston University Institutional Repository (OpenBU)

Genes involved in complex adaptive processes tend to have highly conserved upstream regions in mammalian genomes

Author: Kasif Simon
Kohane Isaac
Lee Soohyun
Publication venue: BioMed Central
Publication date: 01/01/2005
Field of study

BACKGROUND: Recent advances in genome sequencing suggest a remarkable conservation in gene content of mammalian organisms. The similarity in gene repertoire present in different organisms has increased interest in studying regulatory mechanisms of gene expression aimed at elucidating the differences in phenotypes. In particular, a proximal promoter region contains a large number of regulatory elements that control the expression of its downstream gene. Although many studies have focused on identification of these elements, a broader picture on the complexity of transcriptional regulation of different biological processes has not been addressed in mammals. The regulatory complexity may strongly correlate with gene function, as different evolutionary forces must act on the regulatory systems under different biological conditions. We investigate this hypothesis by comparing the conservation of promoters upstream of genes classified in different functional categories. RESULTS: By conducting a rank correlation analysis between functional annotation and upstream sequence alignment scores obtained by human-mouse and human-dog comparison, we found a significantly greater conservation of the upstream sequence of genes involved in development, cell communication, neural functions and signaling processes than those involved in more basic processes shared with unicellular organisms such as metabolism and ribosomal function. This observation persists after controlling for G+C content. Considering conservation as a functional signature, we hypothesize a higher density of cis-regulatory elements upstream of genes participating in complex and adaptive processes. CONCLUSION: We identified a class of functions that are associated with either high or low promoter conservation in mammals. We detected a significant tendency that points to complex and adaptive processes were associated with higher promoter conservation, despite the fact that they have emerged relatively recently during evolution. We described and contrasted several hypotheses that provide a deeper insight into how transcriptional complexity might have been emerged during evolution

Harvard University - DASH

Springer - Publisher Connector

Boston University Institutional Repository (OpenBU)

Genomic functional annotation using co-evolution profiles of gene clusters

Author: Kasif Simon
Roberts Richard J
Zheng Yu
Publication venue: BioMed Central
Publication date: 10/10/2002
Field of study

BACKGROUND: The current speed of sequencing already exceeds the capability of annotation, creating a potential bottleneck. A large proportion of the genes in microbial genomes remains uncharacterized. Here we propose a new method for functional annotation using the conservation patterns of gene clusters. If several gene clusters show the same coevolution pattern across different genomes it is reasonable to infer they are functionally related. The gene cluster phylogenetic profile integrates chromosomal proximity information and phylogenetic profile information and allows us to infer functional dependences between the gene clusters even at great distance on the chromosome. RESULTS: As a proof of concept, we applied our method to the genome of Escherichia coli K12 strain. Our method establishes functional relationships among 176 gene clusters, comprising 738 E. coli genes. The accuracy of pair phylogenetic profiles was compared with the single-gene phylogenetic profile and was shown to be higher. As a result, we are able to suggest functional roles for several previously unknown genes or unknown genomic regions in E. coli. We also examined the robustness of coevolution signals across a larger set of genomes and suggest a possible upper limit of accuracy for the phylogenetic profile methods. CONCLUSIONS: The higher-order phylogenetic profiles, such as the gene-pair phylogenetic profiles, can detect functional dependences that are missed by using conventional single-gene phylogenetic profile or the chromosomal proximity method only. We show that the gene-pair phylogenetic profile is more accurate than the single-gene phylogenetic profiles

Boston University Institutional Repository (OpenBU)

Computational tradeoffs in multiplex PCR assay design for SNP genotyping

Author: Cantor Charles
Ding Chunming
Kasif Simon
Rachlin John
Publication venue: BioMed Central
Publication date: 01/01/2005
Field of study

BACKGROUND: Multiplex PCR is a key technology for detecting infectious microorganisms, whole-genome sequencing, forensic analysis, and for enabling flexible yet low-cost genotyping. However, the design of a multiplex PCR assays requires the consideration of multiple competing objectives and physical constraints, and extensive computational analysis must be performed in order to identify the possible formation of primer-dimers that can negatively impact product yield. RESULTS: This paper examines the computational design limits of multiplex PCR in the context of SNP genotyping and examines tradeoffs associated with several key design factors including multiplexing level (the number of primer pairs per tube), coverage (the % of SNP whose associated primers are actually assigned to one of several available tube), and tube-size uniformity. We also examine how design performance depends on the total number of available SNPs from which to choose, and primer stringency criterial. We show that finding high-multiplexing/high-coverage designs is subject to a computational phase transition, becoming dramatically more difficult when the probability of primer pair interaction exceeds a critical threshold. The precise location of this critical transition point depends on the number of available SNPs and the level of multiplexing required. We also demonstrate how coverage performance is impacted by the number of available snps, primer selection criteria, and target multiplexing levels. CONCLUSION: The presence of a phase transition suggests limits to scaling Multiplex PCR performance for high-throughput genomics applications. Achieving broad SNP coverage rapidly transitions from being very easy to very hard as the target multiplexing level (# of primer pairs per tube) increases. The onset of a phase transition can be "delayed" by having a larger pool of SNPs, or loosening primer selection constraints so as to increase the number of candidate primer pairs per SNP, though the latter may produce other adverse effects. The resulting design performance tradeoffs define a benchmark that can serve as the basis for comparing competing multiplex PCR design optimization algorithms and can also provide general rules-of-thumb to experimentalists seeking to understand the performance limits of standard multiplex PCR

Springer - Publisher Connector

MuPlex: multi-objective multiplex PCR assay design

Author: Cantor Charles
Ding Chunming
Kasif Simon
Rachlin John
Publication venue: Oxford University Press
Publication date: 01/01/2005
Field of study

We have developed a web-enabled system called MuPlex that aids researchers in the design of multiplex PCR assays. Multiplex PCR is a key technology for an endless list of applications, including detecting infectious microorganisms, whole-genome sequencing and closure, forensic analysis and for enabling flexible yet low-cost genotyping. However, the design of a multiplex PCR assays is computationally challenging because it involves tradeoffs among competing objectives, and extensive computational analysis is required in order to screen out primer-pair cross interactions. With MuPlex, users specify a set of DNA sequences along with primer selection criteria, interaction parameters and the target multiplexing level. MuPlex designs a set of multiplex PCR assays designed to cover as many of the input sequences as possible. MuPlex provides multiple solution alternatives that reveal tradeoffs among competing objectives. MuPlex is uniquely designed for large-scale multiplex PCR assay design in an automated high-throughput environment, where high coverage of potentially thousands of single nucleotide polymorphisms is required. The server is available at

Boston University Institutional Repository (OpenBU)

Crossref